Microprocessor in which multiple instructions are executed in one clock cycle by providing separate machine bus access to a register file for different types of instructions

ABSTRACT

A microprocessor having a memory coprocessor (10) connected to a MEM interface (16) and a register coprocessor (12) connected to a REG interface (14). The REG interface (14) and MEM interface (16) are connected to independent read and write ports of a register file (6). An Instruction Sequencer (7) also connected to an independent write port of the register file, to the REG interface and to the MEM interface. An Instruction Cache (9) supplies the instruction sequencer with at least two instruction words per clock (7). Single-cycle coprocessors (4) are connected to the REG interface (14) and a multiple-cycle coprocessors (2) are connected to the REG interface (14). An Address Generation Unit (3) is connected to the MEM interface (16) for executing load-effective-address instructions and address computations for loads and stores to thereby perform effective address calculations in parallel with instruction execution by the single-cycle coprocessor. The Instruction Sequencer (7) decodes incoming instruction words form the Cache, and issues up to three instructions on the REG interface (14), the MEM interface (16), and/or the branch logic within the Instruction Sequencer. The instruction sequencer includes means for detecting dependencies between the instructions to thereby prevent collisions between instructions. A local register cache (5) is provided connected to the MEM interface. The local register cache maintains a stack of multiple word local register sets, such that one each call the local registers are transferred from the register file (6) to the Local Register Cache (5) to thereby allocate the local registers in the register file for the called procedure and on a return the words are transferred back into the register file to the calling procedure.

CROSS REFERENCES TO RELATED APPLICATIONS

This application, which is assigned to Intel Corporation, is related tothe following patents and applications, which are also assigned to IntelCorporation: U.S. Pat. No. 5,023,844, Ser. No. 07/486,408, filed on Feb.28, 1990, granted to Arnold et al. on Jun. 11, 1991; U.S. Pat. No.5,185,872, Ser. No. 07/486,407, filed on Feb. 28, 1990, granted toArnold et al. on Feb. 9, 1993; U.S. Pat. No. 5,222,244, Ser. No.07/630,497, filed on Dec. 20, 1990, granted to Carbine et al. on Jun.22, 1993; amd copending patent applications "Data Bypass Structure in aMicroprocessor Register File to Ensure Data Integrity", Ser. No.07/488,254, filed Mar. 5, 1990; "An Instruction Decoder That IssuesMultiple Instructions in Accordance with Interdependencies of theInstructions" Ser. No. 07/630,536, filed on Dec. 20, 1990; "AnInstruction Pipeline Sequencer With a Branch Lookahead and BranchPrediction Capability That Minimizes Pipeline Break Losses" Ser. No.07/630,535, filed on Dec. 20, 1990; " Instruction Fetch Unit in aMicroprocessor That Executes Multiple Instructions in One Cycle andSwitches Program Streams Every Cycle" Ser. No. 07/630,498, filed on Dec.20, 1990; "A Pipeline Sequencer With Alternate IP Selection when aBranch Lookahead Preduction Fails" Ser. No. 07/686,479 filed on Apr. 17,1991; and, "A High Bandwidth Output Hierarchical Memory Store Includinga Cache, Fethc Buffer and ROM" Ser. No. 07/630,534, filed on Dec. 20,1990.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to data processing systems and more particularlyto a superscaler pipelined microprocessor and a method and apparatustherein for causing multiple functions to be performed during eachpipeline stage.

2. Description of the Related Art

Users of modern computers are demanding greater speed in the form ofincreased throughput (number of completed tasks per unit of time) andincreased speed (reduced time it takes to complete a task). The ReducedInstruction Set Computer (RISC) architecture is one approach systemdesigners have taken to achieve this. While there is no standarddefinition for the term Reduced Instruction Set Computer (RISC), thereare some generally accepted characteristics of a RISC machine. Generallya RISC machine can issue and execute an instruction per clock cycle. Ina RISC machine only a very few instructions can access memory, so mostinstructions use on-chip registers. So, a further RISC characteristic isthe provision of a large number of registers on chip. In a RISC machinethe user can specify in a single instruction two sources and adestination.

In U.S. Pat. No. 4,891,743 "Register Scorboarding on a Microprocessorchip" by David Budde, et al., granted on Jan. 2, 1990 and assigned toIntel Corporation, there is described apparatus for minimizing idle timewhen executing an instruction stream in a pipelined microprocessor byusing a scoreboarding technique. A microinstruction is placed on amicroinstruction bus and a microinstruction valid line is asserted. Whena load microinstruction is decoded, a read operation is sent to a buscontrol logic, the destination register is marked as busy, and executionproceeds to the next current microinstruction. The marking provides anindication as to whether a current instruction can be executed withoutinterfering with the completion of a previous instruction. The markingof registers gives rise to the term "scoreboarding". Execution of thecurrent microinstruction proceeds provided that its source anddestination registers are not marked "busy"; otherwise themicroinstruction valid line is unasserted immediately after the currentmicroinstruction appears on the microinstruction bus. The currentmicroinstruction is thereby canceled and must be reissued. When data isreturned as the result of a read operation, the destination registersare marked as " not busy".

The above-referred copending patent application Ser. No. 07/486,407extends this prior scoreboarding technique to encompass all multiplecycle operations in addition to the load instruction. This isaccomplished by providing means for driving a Scbok line to signal thata current microinstruction on a microinstruction bus is valid.Information is then driven on the machine bus during the first phase ofa clock cycle. The source operands needed by the instruction are readduring the second phase of the clock cycle. The resources needed byoperands to execute the instruction are checked to see if they are notbusy. The Scbok signal is asserted upon the condition that any oneresource needed by the instruction is busy. Means are provided to causeall resources to cancel any work done with respect to executing theinstruction to thereby make it appear to the rest of the system that theinstruction never was issued. The instruction is then reissued duringthe next clock cycle.

The above-referenced copending patent applications Ser. No. 07/486,408and Ser. No. 07/488,254 describe a random access (RAM) register filehaving multiple independent read ports and multiple independent writeports that provide the on-chip registers to support multiple parallelinstruction execution. It also checks and maintains the registersscoreboarding logic as described in Ser. No. 07/486,407. The registerfile contains the macrocode and microcode visible RAM registers. Theregister file provides a high performance interface to these registersthrough a multi-ported access structure, allowing four reads and twowrites on different registers to occur during the same machine cycle.This register file provides a structure allows multiple parallelaccesses to operands which allows several operations to proceed inparallel.

To take full advantage, a processor should be organized so that it canexecute code from an internal instruction cache while having the abilityto add application specific modules to meet different user applications.It should be able to execute multiple instructions in one clock cycleeven when doing loads and branchesd.

It is therefore an object of the invention to provide a microprocessorin which multiple instructions are executed in one clock cycle.

SUMMARY OF THE INVENTION

Briefly, the above object is accomplished in accordance with theinvention by providing a microprocessor having a memory coprocessor (10)connected to a MEM interface (16) and a register coprocessor (12)connected to a REG interface (14). A register file (6) is providedhaving a first independent read port, a second independent read port, athird independent read port, a fourth independent read port, a firstindependent write port and a second independent write port. The REGinterface (14) is connected to the first and second independent readport and the first independent write port. The MEM interface (16) isconnected to the third and fourth independent read ports and the secondindependent write port. An Instruction Sequencer (7) is connected to theREG interface and to the MEM interface.

An Instruction Cache (9) supplies the instruction sequencer with atleast three instruction words per clock (7). Single-cycle coprocessors(4) are connected to the REG interface (14) and a multiple-cyclecoprocessors (2) are connected to the REG interface (14). An AddressGeneration Unit (3) is connected to the MEM interface (16) for executingload-effective-address instructions and address computations for loadsand stores to thereby perform effective address calculations in parallelwith instruction execution by the single-cycle coprocessor.

The Instruction Sequencer (7) decodes incoming instruction words formthe Cache, and issues up to three instructions on the REG interface(14), the MEM interface (16), and/or the branch logic within theInstruction Sequencer. The instruction sequencer includes means fordetecting dependencies between the instructions being issued to therebyprevent collisions between instructions.

In accordance with an aspect of the invention, a local register cache(5) is provided connected to the MEM interface. The local register cachemaintains a stack of multiple-word local register sets, such that oneach call the local registers are transferred from the register file (6)to the Local Register Cache (5) to thereby allocate the local registersin the register file for the called procedure and on a return the wordsare transferred back into the register file to the calling procedure.

A method of operation of a five pipe-stage pipelined microprocessor istaught. During the first pipe stage the Instruction Sequencer accessessaid instruction cache and transfer from said I-Cache to saidInstruction Sequencer three or four instruction words depending onwhether the instruction pointer (IP) points to an even or odd wordaddress.

During a second pipe stage, the Instruction Sequencer decodesinstructions and checks, for dependencies between the issuinginstructions. It then issues up to three instructions on the threeexecution portions of the machine, the REG interface, the MEM interface,and the branch logic within the Instruction Sequencer, only theinstructions that can be executed. The sources for all the issuedoperations are read from the register file during the second pipe stageand, the sources for all the issued operations are sent out to therespective units to use.

During a third pipe stage, the results of doing the EU (4) and/or theAGU (3) ALU/LDA operations are returned to the register file whichwrites the results into the destination registers of the register file.

In accordance with an aspect of the invention, during the third pipestage, the address is issued on the external address bus for loads andstores that go off-chip.

During a fourth pipe stage data is placed on the external data bus.

During a 5th pipe stage the bus controller returns the data to theregister file.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features, and advantages of theinvention will be apparent from the following more particulardescription of a preferred embodiment of the invention as illustrated inthe accompanying drawings.

FIG. 1 is a functional block diagram of each of the major components ofthe microprocessor in which the invention is embodied;

FIG. 2 is a more detailed block diagram of the interconnections betweenthe register file (6) of FIG. 1 and the coprocessors on the machine bus;

FIG. 3 is a timing diagram of a three stage pipeline for the basicALU/LDA-type operations, a three, four, or five stage pipeline forloads, and a two stage pipeline for branches;

FIG. 4 is a timing diagram of the basic three-stage pipeline throughwhich most instructions flow;

FIG. 5 is a timing diagram of the operation of the execution unit; and,

FIGS. 6A and 6B comprises a flow chart of the method of operation of themicroprocessor.

DESCRIPTION OF THE PREFERRED EMBODIMENT

U.S. Pat. No. 4,891,753 "Register Scoreboarding on a MicroprocessorChip" granted on Jan. 2, 1990 and assigned to Intel Corporationdescribes a microprocessor which has four basic instruction formats thatmust be word aligned and are 32-bits in length. The REG formatinstructions are the register-to-register integer or ordinal (unsigned)instructions. The MEM format instructions are the loads, stores, oraddress computation (LDA) instructions. The MEM format allows anoptional 32-bit displacement. The CTRL format instructions are thebranch instructions. The COBR format is an optimization that combines acompare and a branching one instruction. The microprocessor in which thepresent invention is embodied has a 32-bit linear address space and has32 general purpose registers. Sixteen of these registers are global and16 are local. These 16 local registers are saved automatically on a calland restored on each return. The global registers, like the registers inmore conventional microprocessors, retain their values across procedureboundaries.

As shown in FIG. 1 the microprocessor in which the present invention isembodied has seven basic units. They are:

The on-chip RAM/Stack Frame Cache (I-cache 9)

The Instruction Sequencer (IS-7)

The Register File (RF-6)

The Execution Unit (EU-4)

The Multiply/Divide Unit (MDU-2)

The Address Generation Unit (3,5)

These units are briefly described below. For more detailed informationabout each of these units refer to the above-identified copendingapplications.

Instruction Cache and ROM (I-Cache)

This unit (9) is described more fully in copending applications "AnInstruction Decoder That Issues Multiple Instructions in Accordance withInterdependencies of the Instructions" Ser. No. 07/630,536 and "A HighBandwidth Output Hierarchical Memory Store Including a Cache, FetchBuffer and ROM" Ser. No. 07/630,934. The instruction cache and ROM (9)provides the Instruction Sequencer (7) with instructions every cycle. Itcontains a 2-way set-associative instruction cache and a microcode ROM.The I-Cache and ROM are essentially one structure. The ROM is analways-hit portion of the cache. This allows it to share the same logicas the instruction cache, even the column lines in the array. TheI-Cache is four words wide and is capable of supplying four words perclock to the Instruction Sequencer (IS-7). It consistently suppliesthree or four words per clock regardless of the alignment of theinstruction address. The I-Cache also contains the external fetchhandling logic that is used when an instruction fetch misses theI-Cache.

Instruction Sequencer (IS)

This unit (7 is described more fully in copending applications "AnInstruction Decoder That Issues Multiple Instructions in Accordance withInterdependencies of the Instructions" Ser. No. 07/630,536, "AnInstruction Pipeline Sequencer With a Branch Lookahead and BranchPrediction Capability That Minimizes Pipeline Break Losses" Ser. No.07/630,535, "Instruction Fetch Unit in a Microprocessor That ExecutesMultiple Instructions in One Cycle and Switches Program Streams EveryCycle" Ser. No. 07/630,498, "A Pipeline Sequencer With Alternate IPSelection when a Branch Lookahead Prediction Fails" Ser. No. 07/686,479and "An Instruction Decoder Having Multiple Alias Registers WhichProvide Indirect Access in Microcode to User Operands" U.S. Pat. No.5,222,244.

The instruction sequencer (7) decodes the incoming four instructionwords from the I-Cache. It can decode and issue up to three instructionsper clock but it can never issue more than four instruction in towclocks. This unit detects dependencies between the instructions andissues as many instructions as it can per clock. The IS directlyexecutes branches. It also vectors into microcode for the fewinstructions that need microcode and also to handle interrupts andfaults. The instruction decoder (ID) and the pipeline sequencer (PS) areparts of the Instruction Sequencer (7). The IS decodes the instructionstream and drives the decoded instructions onto the machine bus.

Register File (RF)

This unit (6) is described more fully in copending applications"Register Scoreboarding Extended to all Multiple-cycle operations in aPipelined Microprocessor", Ser. No. 07/486,407, "Six-way Access PortedRAM Array Cell", Ser. No. 07/486,408, and "Data Bypass Structure in aMicroprocessor Register File to Ensure Data Integrity", Ser. No.07/488,254.

The RF (6) has 16 local and 16 global registers. It has small number ofscratch registers used only by microcode. It also creates the 32literals (0-31 constants) specified by the architecture. The RF has 4independent read ports and 2 independent write ports to support themachine parallelism. It also checks and maintains the registerscoreboarding logic which is described more fully in copendingapplication Ser. No. 07/486,407.

Execution Unit (EU)

The EU (4) performs all the simple integer and ordinal (unsigned)operations of the microprocessor in which the present invention isembodied. All operations take a single cycle. It has a 32-bitcarry-look-ahead adder, a boolean logic unit, a 32-bit barrel shifter, acomparator, and condition code logic.

Multiply-Divide Unit (MDU)

The MDU (2) performs the integer/ordinal multiply, divide, remainder,and modular operations. It performs an 8-bit clock multiple and a 1bit-per-clock divide. A multiply has 4 clock throughput and 5 clocklatency and a divide has 37 clock throughput and 38 clock latency.

Address Generation Unit (AGU)

The AGU (3) is used to do the effective address calculations in parallelwith the integer execution unit. It performs the load-effective-addressinstructions (LDA) and also does the address computations for loads andstores. It has a 32-bit carry-look-ahead adder and a shifter in front ofthe adder to do the prescaling for the scaled index addressing modes.

Local Register Cache (LRC)

The LRC (5) maintains a stack of multiple 16-word local register sets.On each call the 16 local registers are transferred from the RF to theLRC. This allocates the 16 local registers in the RF for the calledprocedure. On a return the 16 words are transferred back into the RF tothe calling procedure. The LRC uses a single ported RAM cell that ismuch smaller than the 6-ported RF cell. This keeps the RF small and fastso it can operate at a high frequency while allowing 8+ sets of localregisters to be cached on-chip. With this LRC the call and returninstructions take 4 clocks.

On-Chip Coprocessors

The microprocessor in which the present invention is embodied has twovery high performance interfaces--the REG interface (14) and MEMinterface (16). These interfaces allow application-optimized modules tobe added to tailor the system to a particular application area. The REGinterface is where all the REG format instructions are executed. The EU(4) and MDU (2) described above are coprocessors (on-chip functionalunits) sitting on the REG interface. Other units can be added, such as aFloating Point Adder and a Floating Point Multiplier. The REG interfacehas two 64-bit source buses, src 1 (20) and src 2 (22) and a 64-bitdestination bus (24). These buses provide a bandwidth of 528 MB/sec forsource data and 264 MB/sec for result data to and from this REGinterface.

One instruction per clock can be issued on this REG part of the machine.The operations can be single or multi-cycle as long as they areindependently sequenced by the respective REG coprocessor (12). Thecoprocessors on the REG interface arbitrate among themselves ifnecessary to return their results. There can be multiple outstandingmulti-cycle operations such as integer or floating point multiply anddivide. The number outstanding is limited only by the number and nature(whether pipelined or not) of the REG coprocessors.

The MEM interface (160 is where all MEM format instructions areexecuted. It also connects the system to the memory subsystem. Theon-chip memory subsystem can be a bus controller that connects tooff-chip members. The AGU (3) and LRC (5) mentioned above arecoprocessors on the MEM interface. Other units can be added to thisinterface such as TLB, a data cache, an on-chip RAM array, etc. Thisinterface has a 32 bit address port, a 128-bit store bus, and a 128-bitload bus. This allows 528 MB/sec to be transferred each way between thecore processor and the memory subsystem. One instruction per clock canbe issued on this interface. The operations can be single or multi-cyclejust as described above for the REG coprocessors. The coprocessors onthis interface arbitrate among themselves if needed to return theirresults. There can also be multiple outstanding operations on this partof the machine such as multiple outstanding loads. The number ofoutstanding operations is constrained only by the nature of the buscontroller or other on-chip coprocessors.

The majority of all instructions executed use no microcode; they aredirectly issued like any other RISC machine. Microcode is used for a fewinstructions but mainly for fault, interrupt handling, and debug (trace)handling support. There are a few extra microinstructions that helpspeed up critical operations such as call and return and that accessinternal control registers, etc.

Key Microcoded Instructions

    ______________________________________                                                Call/Return (4 clocks)                                                        Cmp.sub.-- and.sub.-- Branch                                                              (1 clock)                                                         Branch.sub.-- and.sub.-- Link                                                             (1 clock)                                                         Some Effective Address Computations                                           Atomic operations (for semaphores, etc)                               ______________________________________                                    

Microcoded instructions generally have no start-up overhead associatedwith them. They are seen in a lookahead manner by the InstructionSequencer (7) so it can get ready for them while the currentinstructions are issued. Because of this, the microinstructions of amicrocoded sequence show up seamlessly between the previous andsubsequent non-microcoded macro instructions.

The Basic Pipeline

As FIG. 3 shows, the microprocessor in which the present invention isembodied has a three stage pipeline for the basic ALU/LDA-typeoperations, a three, four, or five stage pipeline for loads, and a twostage pipeline for branches.

Briefly, the pipeline operates as follows. During the first ppe stage,pipe 0, the Instruction Sequencer (7) accesses the instruction cache(9). The I-Cache returns three or four instruction words depending onwhether the UP points to an even or odd word address.

During the second pipe stage, pipe 1, the Instruction Sequencer (7)decodes and issues up to three instructions on the three executionportions of the machine--the REG interface (14), the MEM interface (16),and the branch logic within the IS (7). Hardware checks for dependenciesand only issues the instructions that can be executed. During thissecond pipe stage the RF (6) reads the sources for all the issuedoperations and sends them to the respective units to use. The IS alsocalculates the new UP now for branch operations.

During the third pipe stage, pipe 2, the EU (4) and/or the AGU (3) dothe ALU/LDA operations and return the results to the RF. The RF thenwrites the results into the destination registers.

If the operation will take more than one cycle, the scoreboard bits areset (126) and the bus controller (10) issues the address on the externaladdress bus for loads and stores that go off-chip (118).

During the fourth pipe stage, pipe 3, assuming zero wait states, thedata return on the external data bus to the bus controller (120).

During the fifth pipe stage, pipe 4, the bus controller (10) returnsthis data to the RF (122).

Wide and Concurrent Buses

Wide and concurrent buses are provided to feed the respective unitswithout bottlenecks. The microprocessor in which the present inventionis embodied can take much more advantage of wide buses than most otherRISC-type machines. It has instructions to move, load, or store 64, 96,or 128-bit operands. The floating point part of the instructions has 64and 80-bit operands. Wider buses make these operations faster and alsoeasier to implement. The on-chip buses that are wider than 32 bits arethe 128-bit wide instruction cache bus, the load, bus, the store bus,and the 64-bit source 1, source 2, and result buses.

Parallel Decode and Issue

The parallel decode and issue starts with the 4 words/clock bandwidthfrom the I-Cache (9). A parallel decoder in the Instruction Sequencer(7) looks at this window of 3 or 4 instructions and operates on them.The IS issues the instruction in the first word. It looks ahead past thefirst instruction to see if the second word is a memory operation. If sothat IS issues it also. The IS looks ahead at the second through fourthwords to see if any one is a branch. If so it issues the first branch itsees. The multi-ported register file allows all these operations toconcurrently access the data they need.

Multiple Independent Functional Units

The three concurrent interfaces described above connect to multipleindependent functional units or on-chip coprocessors. The standardfunctional units to perform the basic instruction set are the ExecutionUnit (4), the Multiply/Divide Unit (2), the Address Generation Unit (3),the Local Register Cache (5), and a bus controller (10). Others can beadded to provide high performance floating point, provide moreperformance through caching, or to do a peripheral function.

The microprocessor in which the present invention is embodied requiresall instructions to be executed in the same manner as they would if theywere being executed sequentially. The system does not know how long thecoprocessors connected to it will take to complete the operations theyhave been given to perform. For example, a load on the external bus withits asynchronous wait states.

Parallelism is managed by using resource scoreboarding, as described inthe above-identified copending application Ser. No. 07/486,407. Eachinstruction needs to use certain resources to execute. A resource mightbe a register, a particular functional unit, or even a bus. It anyinstruction being issued is lacking any needed resource then it must bestopped.

During the second pipe stage shown in FIG. 3, the resources are checkedconcurrently with the issuing and beginning of the instructions so thisdoes not slow down the operating frequency. Each instruction isconditionally canceled and ressued depending on the resource check forthat instruction. Register Scoreboarding sets the destination registeror registers busy once it passes the resource check. When the resultreturns--whether 1 or many cycles later--the resultant register getsmarked as not busy and free to use. Each multi-cycle functional unitmaintains a busy signal that is used to delay a new instruction thatneeds to use this busy unit.

Branch Prediction and Condition-Code Scorboarding

Most instructions do not set the condition codes. In general, thecompare instructions are the only operations that set them. Thisarrangement allows unrelated operations to be placed between comparesand branches so the condition codes are guaranteed to be settled beforea conditional branch needs to use them.

If this system did a compare and a branch one instruction at a time thenthe condition codes would always be settled. However, the IS does branchlookahead. This lookahead effectively causes one or two delay slotsbetween a compare and a branch since the IS sees the branch one or twoclocks earlier that if it did just one operation per clock. Sometimesthere are no unrelated operations that can be placed in the delay slotbetween a compare and when a conditional branch is executed. The IS usesbranch prediction to help hide these 1 or 2 delay slots between thecompare and the branch. Rather than hang and wait for the conditioncodes to become valid a guess is made as to which way to branch. Themicroprocessor in which the present invention is embodied has a staticbranch prediction bit used to determine the branch guess direction. Thecompiler or assembler sets the bit based on the most likely branchdirection. This gives the most flexibility in chosing the branch guessalogrith--including profiling the application and setting the bits basedon the profile.

Once the guess is made, the system begins actual execution at theassumed target. As long as the guess is correct, the prediction hidescompletely the condition code delay slots. The IS keeps track ofcondition code altering instructions to scoreboard the condition codes.When the condition codes have settled, the IS checks to see if the guesswas correct. If the guess is wrong it will cause a one or two clockdelay versus if it guessed correct. This is a win or tie mechanism. Theguess case is never worse than it no guess were made as long as the codeis in the I-Cache. The incorrect instructions are canceled using thesame mechanism used to handle the register and unit scoreboardingmentioned above.

The performance of the microprocessor in which the present invention isembodied depends on several factors: the on-chip I-cache size, whetheror not there is an on-chip D-cache, the external bus bandwidth andlatency (wait states).

The term "coprocessor" is used herein to designate hardware that is usedto perform transformations on data that is sourced/returned to theregister file. A floating point unit, a DMA unit, or a DSP module areall examples of coprocessors.

The system has a 32-bit bit linear address space and 32 general purposeregisters. 16 of these registers are global and 16 are local registers.The local registers are pushed onto an on-chip stack-frame cache on eachcall and popped back off on each return. This greatly reduces off-chipregister saving and restoring when doing calls and returns. The globalregisters are like conventional registers which retain the same valuesacross subroutine boundaries.

The instructions include integer/ordinal arithmetic operations(including multiply, divide, remainder), logical and bit manipulationoperators, a rich set of conditional branch and comparison instructions,and load, store, and load-effective-address instructions. The system hasa full complement of addressing modes for efficient memory addressing.All arithmetic/logical/bit operations have up to 3 registerspecifiers-two for sources and one for the destination. the destination.

There are seven main interfaces to the rest of the chip. They are:

MEM coprocessor interface

Execution or REG coprocessor interface

Event interface

Microcode Flag interface

Special Function Register interface

ICE Interface

Clock interface.

The Memory Coprocessor Interface

The MEM format instructions (load, store, lda, instruction fetch, etc)are executed. This interface is used to allow the system to talk to thememory subsystem. The Address Generation unit (AGU) and the InstructionSequencer (IS) fetch logic are connected to this interface as MEMcoprocessors. This interface is connected to the BCL and the on-chipRAM/stack frame cache unit. The MEM coprocessor interface could alsoconnect to other on-chip MEN coprocessors such as a DMA, a TLB, orgeneral data cache. It has a 32-bit memory address bus, a 128-bit storebus, and a 128-bit load bus along with the MEM part of the machine) busto control it. It can have an arbitrary number of MEM coprocessorsconnected to it.

The Execution or REG Coprocessor Interface

The REG format instructions are executed on this interface. TheExecution Unit (EU) is connected to this interface as a REG coprocessor.This interface is used to allow the addition of other on-chip executioncoprecessors such as an FPU, DSP unit, etc. The REG coprocessorinterface has two 32/64-bit sources, one 32/64-bit result or destinationbus, along with the REG portion of the machine) bus to control it. It isvery flexible and simple yet allows very high performance coprecessorsto connect to it. It can have an arbitrary number of REG coprocessorsconnected to it.

The Clock Interface

This interface is the chip clock phases for a clock as described in ImelU.S. Pat. No. 4,816,700. The system uses the overlapped clock phases tohelp achieve the performance requirements but will also work with thetraditional in-overlapped clock designs at a reduced operatingfrequency.

Instruction Flow

Most instructions flow through a simple there stage pipeline shown inFIG 3. During the first stage of the pipeline, pipe 0, the nextinstruction address is calculated and used to fetch the next instruction(INSTf1) from the instruction cache to execute. In pipe 1 theinstruction is decoded and issued to the execution unit and then thesource operands (OPRf1) are read and sent to the execution unit. In pipe2 the operation is performed and the result (RES1) is returned to theregister file. The hardware is segmented into three separate pieces,each roughly associated with a stage in the pipeline. Pipe 0 hardwareroughly corresponds to the Instruction Sequencer (IS). Pipe 1 hardwareroughly corresponds to the Register File (RF) and Pipe 2 hardware ismostly contained within the Execution Unit (EU).

In this specification, signals follow a naming convention to helpclarify the description of the pipeline. It is based on the pipelinestage and the clock phase. A control signal latched in the clock phase 2(Ph2) portion of pipeline stage 1 has a suffix onf q12, e.g. LdRegq12.The "q" is a delimiter indicating that the signal is latched to trappedand so will be constant for the phase indicated and also the followingphase. The "12" indicates pipe 1 ph2. Other examples are S1Adrq11,BclGntq41, etc. If a signal is only valid during one phase (for examplea precharge/discharge signal) it is suffixed with "u21", e.g. LdRamul12.The "u" delimiter indicates that this signal is only valid for onephase.

Pipeline Operation

Refer to the flow diagram of FIGS. 6A and B for a flow of operations asan instruction passes through each stage of the pipeline.

Pipe 0-Get the Instruction

Pipeline stage 0 is when the Instruction Sequencer (7) calculates thenext instruction address (102). This could be a macro-instruction ormicro-instruction address. It is either the next sequential address orthe targer of a branch. The IS uses the condition codes or themicroarchitecture flags signals to tell which way to branch. If they arenot valid yet when the IS sees a branch it guesses whether to take ornot take the branch based on the take branch bit in the branchinstruction. Execution is begun on the path based on that guess. If theguess was wrong the IS cancels the instructions begun on the wrong pathand begins instruction fetching along the correct path.

The Instruction Sequencer (7) accesses (104) the instruction cache (9).The I-Cache returns (106) three or four instruction words depending onwhether the IP points to an even or odd word address.

Pipe 1--Emit stage-- Issue and check all resources

During the second pipe stage, pipe 1, the Instruction Sequencer (7)decodes (108) and issues up to there instructions of the there executionportions of the machine, the REG interface (14), the MEM interface (16),and the branch logic within the IS (7). Hardware checks for dependencies(110) and only issues (112) the instructions that can be executed.During this second pipe stage the RF (6) reads (114) the sources for allthe issued operations and sends them to the respective units to use TheIS also calculates the new IP now for branch operations.

The instructions get sent (116) to the other units by being driven onthe machine bus which consists of three parts:

1. The REG format instruction portion (add, mult, shl, etc).

2. The MEM format instruction portion (ld, st, lda, instruction fetch,etc).

3. The CTRL format portion (branches).

Each part of the machine bus goes to the units that help execute thattype of instruction. For example the Register file supplies the sourcesfor both a store and an XOR operation so it looks at both the REG andMEM portion of the machine bus. The CTRL portion stays within theInstruction Sequencer since it directly executes the branch operations.

When an instruction is used several things happen. First, theinformation is driven on the machine bus during q11. Then during q12 thesource operands are read and the resources needed to execute theinstruction are checked to see if they are all available. If they areavailable then the scoreboard ScbOK signal is left asserted and theinstruction is officially issued. If any resource needed by theinstruction is busy (reserved by a previous incomplete instruction orfull because already working n as much as it can handle) then the ScbOKsignal is deasserted by pulling it low. This tells any unit looking atthat instruction to cancel any work done; thus making it appear as if itnever happened. The IS will then attempt to reissue the instruction onthe next clock so the same sequence of operations will repeat.

If the instruction address is not in the instruction cache when it ischecked during q02-- if there is a cache miss-- then the fetch logicissues a fetch on MEM side of machine durig q11. This fetch looks justlike a normal quad word load except the destination of the fetch is theInstruction Sequencer rather than the Register File.

Pipe 2-- Computation stage and return stage.

During this stage the EU (4) and/or the AGU (3) do the ALU/LDAoperations (122) and return (123) the results to the RF. The RF thenwrites (124) the results into the destination registers. The computationis begun and completed in one phase if it is a simple single-cycle ALUoperation, the "no" path out of decision block (120). If the operationis a long one, the "yes" path out of decision block (120), it takes morethan 1 clock and the result or destination registers are marked as busyby setting the scoreboard bits (126). A subsequent operation needingthat specific register resource will be delayed until this longoperation is completed. This is called scoreboarding the register. Thereis one bit per 32-bit register called the scoreboard bit that is used tomark it busy if a long instruction. This scoreboard bit is checkedduring q12.

If the operation is a simple ALU type operation then the result iscomputed during q21 (block 122) and returned to the register file duringq22 (block 123). As the data is written to the destination register thescoreboard bit is cleared marking the register available for use byanother instruction.

During the third pipe stage, the address is issued (128) on the externaladdress but for loads and stores that go off-chip.

Pipe 3

During the fourth pipe stage, pipe 3, assuming zero wait states, thedata returns (130) on the external data bus to the bus controller.

Pipe 4

During the fifth pipe stage, pipe 4, the bus controller (10) sends thisdata to the RF (12) and the register file writes the results into thedestination registers and clears the scoreboard bits (134).

Register file parallelism

The register file (6) is a 36 entry by 32-bit register file that isaccessible from six ports and is organized as nine rows of four wordseach. It contains 16 global registers 16 local registers and 4 scratchregisters. Two register file ports (128bits wide) interface to thememory interface through two separate 128 -bit busses that are operatedat 32 Mhz (512MByte bandwidth each). These two ports allow LOAD datafrom a previous read operation and STORE data from a current writeaccess to be processed in the register simultaneously. Another 32-bitport allows an address or address reduction operand to be simultaneouslyfetched. Two more 64-bit ports allow simultaneously two source operandsto be fetched and operated on by either the execution hardware or by anapplication-chip's REG coprocessor hardware. The final 64-bit portallows the result from the previous operation (pipelined execution) tobe stored simultaneously with the current operation's source operandreads. Thus, the multi-ported register file allows one simplelogic/arithmetic operation to be performed every cycle and, at the sametime, it allows one memory operation (LOAD/STORE) to be performed everyclock cycle. The register file is also designed to allow multipleoperations to be in progress. This is a useful feature that improvesperformance where a result from a previously started operation is notyet available, but other unrelated operations can be executed inparallel. This is sometimes called overlapped execution and has usuallyonly been applied to LOAD operations. In the current system, overlappedexecution is applicable to all multiple-cycle operations such asmultiply, divide, load, etc.

Instruction Sequencer parallelism

The instruction sequencer is designed to take advantage of the registerfile's parallelism. It looks at up to four instructions in every cycleand issues up to 3 instructions at a time. It tries to issue a simpleREG format instruction (ALU operation for example), a MEM formatinstruction (ld, st, lda), and a CTRL (branch) operation to the differntcoprocessors every cycle if it can. The system can sustain issuing andexecuting two instructions per clock; thus it can sustain executing at apeak rate of 64 MIPS when operated at an internal clock rate of 32 Mhz.

Simple branches and loads can be executed completely in parallel withsimple REG format operations to achieve the maximum execution rate. Asimple four-instruction sequence is shown in below that achieves 64 MIPSprovided the LOAD instruction gets data from a 2-state on-chip RAM andthe branch taken guess ahead) opcode bit is set.

    ______________________________________                                        seq: add    g4,g5,g5  |g5 = g4 + g5;                                      ld     (mem), g4 |load g4 with data from RAM                    cmpdec;       |compare and decrement loop counter                    be seq;       |branch if equal to seq:                               ______________________________________                                    

This sequence is shown below from a pipe 0 perspective. It would executein a tight two cycle loop.

    ______________________________________                                        state 1        add ld                                                         state 2        cmpdec        branch                                           state 3        add ld                                                         state 4        cmpdec        branch                                           state 5        add ld                                                         state 6        cmpdec        branch                                           ______________________________________                                    

This sequence of instructions takes advantages of several performanceoptimization features: register scoreboarding on the LOAD, instructionlook ahead and parallel instruction execution, and finally branchprediction. In the case of an incorrect branch prediction, 1 to 2instructions are cancelled and the correct sequence is started.

The instruction sequencer can do this looking ahead whenever the firstinstruction it fetches is a simple REG format RISC operation (e.g. add,xor, multiply). When instruction lookahead is enabled, it also allowsmicrocoded instructions to get a one cycle head-start in going throughthe internal pipeline as the current instruction is being issued. Thisimproves the execution time of these micro coded instructions by oneclock compared to execution with look ahead disabled. If the currentinstruction requires microcode interpretation or is a branch theninstruction look ahead is disabled.

Instruction Cache

Since the instruction sequencer has a very big appetite forinstructions, the system incorporates an instruction cache that candeliver up to four instructions per clock cycle to the instructionsequencer. This cache allows inner loops of execution to occur withoutrequiring external instruction fetches that could collide with datafetches and decrease performance. This cache also allows us to keep thepipeline depth low without impacting frequency in the following manner.As instructions get updated into the instruction cache (e.g. during thefirst pass through a loop) some precoding of the instructions occurs.This predecoding expands the instruction width from 32 bits to 35 bitswhen stored in the instruction cache. When read out of the cache, theseextra bits allow several instruction look ahead decisions to be madewithout associated decoding delays. The integrated cache is consolidatedwith the microcode ROM since partially decoded macrocode is identical informat to internal microinstructions.

Simple Branch example

A simple instruction pointer relative branch that hits the internalcache causes only a one cycle pipeline break for a simple code sequence.The branch then takes an effective 2 clocks. This is the worst casebranch time if the instructions are in the instruction cache. Thisanalysis assumes that the instruction sequencer is prevented fromperforming instruction look ahead on the branch. In general, theexecution time of all types of branches is improved by one or two cycleswhenever the instruction sequencer can look ahead and do the branch inparallel with previous operations.

Register Bypassing

Register bypassing is also known as result forwarding is described inthe above-referenced application Ser. No. 07/488,254. An AND instructionwhich requires register g4 as an input operand that must be fetched instate 5; however, g4 register contents are updated by the previousinstruction (xor) at the very end of state 5. Thus, it is not possibleat the highest operating frequency to obtain the updated version of g4from the register file in time to satisfy the operand fetchingrequirements of the succeeding and instruction. To alleviate thisproblem and to insure that RISC-like operations can proceed at the rateof one per cycle, bypass multiplexors are built into the Register Fileunit. These bypass muxes forward the previous results on the returnbusses (Load return and REG format instruction return) onto the sourcebusses to the execution hardward so that the bypassed data can beoperated on immediately in the next cycle. To accomplish this function,the destination-register address of each returning instruction must becompared against the next instruction's source-register addresses to seeif the bypass function needs to be involved, This bypassing is done forall cases of result returns to source reads.

Complex Instructions

Finally, any complex instructions that require a microcode sequence tobe executed to complete the function are also detected early and theircontrol-flow (microcode) initation's latency is overlapped with theexecution of previous instructions. Due to the RISC-like, fixedinstruction length, and orthogonal attributes of the instruction-setencodings, the system can incorporate the instruction-look aheadfunctions quite cheaply. The micro coded instructions include CALL,RETURN, and some of the complex addressing modes. The optimized lookahead for these complex micro coded instructions saves many cycles.

FIG. 2 is a diagram showing all the interconnections from the executionunit (EU) to the rest of the system. Below is a general description ofall signals connecting the EU with the rest of the system.

Data Buses

There are 3 data buses on the coprocessor side of the microprocessor--the source1 bus (SRc1H/Srcl--64 bis), the source 2 bus(Src2H/Src2--64bits), and the destination bus (Dsthi/Dstlo--64 bits).All coprocessors receive operands from the Register File (RF) or SFR'sonly and return results to the RF or SFR's only. Source1/Source2 are theinput buses which drive data from the RF to all the coprocessors.

Destination (DST) is the precharged bus used by the coprocessors toreturn results to the RF. All coprocessors hook to these buses; however,the EU in most cases only uses the lower 32 bits of these three buses.Only in the "movl1" instruction does the EU use as input the high 32bits of Source1. Only in the "movl1" and "mov-add-64" instructions doesit drive the high 32 bits of the Destination bus.

Address Buses

All coprocessors hook to two address buses - Dstadrout (7 bits) andDstardrin (7 bits). The instruction sequencer (IS) broadcasts both theopcode and the destination operand address to all coprocessorssimultaneously. The destination operand address is broadcast on theDstadrout bus. The coprocessor latches this address, executes theinstruction, and before returning the result on the destination busdrives the Dstadrin bus with this same address. The Dstadrin bus is aprecharge bus which the RF latches and decodes for the destinationoperand's address.

Along with the Dstadrin bus there is a single line. Wr64bit. This signalis driven by coprocessors to the RF when returning a 64 bit value(instead of a 32 bit value). The EU drives this line only when executingeither a "mov1" or "mov-add-64" instruction. This signal is alsoprecharged signal.

The Wr64bit is not broadcast with the Dstadrout. It is determined solelyfrom the opcode. Thus, it follows that the register file must also beable to detect all instructions which return 64 bit values so thatappropriate scoreboard bits may be set.

Opcode (and OpcodeL)

The opcodes for instructions can be up to 12 bits long. Of those, 8 bitsrepresent opcodes in one of the four instruction formats. REG, MEM,CPBR, CTRL. The coprocessors only execute the REG format instructionswhich representθ of the opcode space. Thus, of these 8 bits theinstruction sequencer only broadcasts 6 bits to the coprocessors on the"opcode" bus; the REG format instruction type on this bus is implied.Four other bits further decode instructions within the REG format space.They are broadcast on the "opcodel" lines. Both "opcode" bus and the"opcodel" bus are precharged buses.

Scbok

This signal is both an input and an output signal to the EU.

In pipestage 1, phase 2, Scbok is an input as far as the EU isconcerned. If it is pulled low at this time it indicates that either aresource that the EU needs is not free (i.e. a register to be used asdestination) or that another single-cycle coprocessor has faulted orneeds an assist. In either case the EU does not execute its instruction.

In pipestage 2, phase 2, scbok is an output as far as the EU isconcerned. Scbok is pulled low by the EU in case of an EU fault or eventbut the current operation the EU is performing continues to completion.Pulling Scbok at this stage stops execution of the next instruction inthe pipe and allows the instruction sequencer to start execution of thefault or event handler.

Cceuidq12 and Cceuidq22

This ia 3 bit bus on which the IS sends the condition codes (CCC) to theEU during pipe 1, ph2. The EU is the only unit which can modify the CCC.It does so (if necessary) during pipe 2, ph1 and returns the modifiedCCC to the IS during the following phase 2--pipe 2, ph2. The Ccequidq22is the 3 bit bus on which the CCC is returned.

While the invention has been particularly shown and described withreference to preferred embodiments thereof, it will be understood bythose skilled in the art that the foregoing and other changes in formand detail may be made therein without departing from the scope of theinvention.

What is claimed is:
 1. A microprocessor comprising:a REG interface (14),A MEM interface (16); a macro bus (11); an instruction cache (9)connected to this macro bus capable of supplying multiple instructionson said macro bus during a single clock cycle; an instruction sequencer(7) connected to said macro bus, said REG interface, said MEM interface;said instruction sequencer including an instruction decoder and a branchlogic; a multiported register file (6) connected to said instructionsequencer (7) capable of reading from multiple sources of instructionsduring a single clock cycle; said instruction decoder within saidinstruction sequencer (7) being capable of decoding multipleinstructions and issuing, multiple instructions during a single clockcycle on said REG interface, said MEM interface and said branch logic;first coprocessors (2,4,12) connected in parallel to said REG interfacefor receiving first instructions having a first format from saidregister file and for executing said instructions in parallel during asingle clock cycle; and, second coprocessors (3,5,10) connected inparallel to said MEM interface for receiving second instructions havinga second format from said register file and for executing said secondinstructions in parallel during a single clock cycle.
 2. The combinationin accordance with claim 1 further comprising:a local register cache (5)connected to said MEM interface for maintaining a stack of multiple-wordlocal register sets, such that on each call the local registers aretransferred from said register file (6) to said Local Register Cache (5)to thereby allocate said local registers in the register file for thecalled procedure and on a return said words are transferred back intothe register file to the calling procedure.
 3. The combination inaccordance with claim 2 wherein said single-cycle coprocessor (4) is aninteger execution unit (4);said execution unit being capable ofexecuting integer arithmetic operations in a single cycle.
 4. Amicroprocessor comprising:a REG interface (14); a MEM interface (16); amemory coprocessor (10) connected to said MEM interface (16); a registercoprocessor (12) connected to said REG interface (14); an InstructionSequencer (7); a Register File (6) having a first independent read port,a second independent read port, a third independent read port, a fourthindependent read port, a first independent write port and a secondindependent write port; said REG interface (14) connected to said firstand second independent read ports and said first independent wire port;said MEM interface (16) connected to said third and fourth independentread ports and said second independent write port; said InstructionSequencer (7) connected to said REG interface and to said MEM interface;an Instruction Cache (9) connected to said instruction sequencer and tosaid MEM interface for providing said Instruction Sequencer withinstructions every cycle; said Cache (9) being multiple words wide andcapable of supplying at least two words per clock to said InstructionSequencer (7); a single-cycle coprocessor (4) connected to said REGinterface (14); a multiple-cycle coprocessor (2) connected to said REGinterface (14); said Multiply-Divide Unit including means for executingmultiply and divide, arithmetic operations requiring multiple cycles;and, an Address Generation Unit (3) connected to said MEM interface(16); said Address Generation Unit including means for executingload-effective-address instructions and address computations for loadsand stores to thereby perform effective address calculations in parallelwith instruction execution by said integer execution unit; saidInstruction Sequencer (7) including means for decoding said incominginstruction words from said Cache, and issuing up to three instructionseach on one of, said REG interface (14), said MEM interface (26), andsaid branch logic within said Instruction Sequencer, including means fordetecting dependencies between the instructions to thereby preventcollisions between instructions.
 5. The combination in accordance withclaim 4 further comprising:a local register cache (5) connected to saidMEM interface for maintaining a stack of multiple-word local registersets, such that on each call the local registers are transferred fromsaid register file (6) to said Local Register Cache (5) to therebyallocate said local registers in the register file for the calledprocedure and on a return said words are transferred back into theregister file to the calling procedure.
 6. The combination in accordancewith claim 4 wherein said single-cycle coprocessor (4) is an integerexecution unit (4);said execution unit being capable of executinginteger arithmetic operations in a single cycle.
 7. The combination inaccordance with claim 4 wherein said a multiple-cycle coprocessor is amultiply-divide unit (2);said Multiply-Divide Unit being capable ofexecuting multiply and divide arithmetic operations requiring multiplecycles.
 8. The combination in accordance with claim 4 wherein said amultiple-cycle coprocessor is a multiply-divide unit (2);saidMultiply-Divide Unit being capable of executing multiply and dividearithmetic operations requiring multiple cycles.
 9. In a five pipe-stagepipelined microprocessor which includes an instruction cache (9), anInstruction Sequencer (7) including branch logic, instruction pointer(IP), a REG interface (14), a MEM interface (16), register file (6) ofregisters including destinations registers, the method comprising thesteps of:(A) accessing (104) said instruction cache (9) during the firstpipe stage the Instruction Sequencer (7); (B) transferring (106) fromsaid I-Cache to said Instruction Sequencer (7) three or four instructionwords depending on whether said instruction pointer (IP) points to aneven or odd word address; (C) decoding (108) said instructions in saidInstruction Sequencer (7) during said second pipe stage; (D) checking(110), during said second pipe stage, for dependencies betweeninstructions; (E) issuing (112), during said second pipe stage, up tothree instructions on the three execution portions of the machine, saidREG interface (14), said MEM interface (16), and said branch logicwithin said Instruction Sequencer (7), only the instructions that can beexecuted; (F) reading (114) into said register file (6), during saidsecond pipe stage, the sources for all the issued operations; (G)sending (116) out of said register file (6), during said second pipestage, said sources for all the issued operations to the respectiveunits to use; (H) calculating (118), during said second pipe stage, thenew IP for branch operations; (I) returning (122) to said register file,during said third pipe stage, the results of doing the EU (4) and/or theAGU (3) ALU/LDA operations; and, (J) writing (124) said results intosaid destination registers of said register file.
 10. The method inaccordance with claim 9 comprising the further steps of:(K) issuing(128), during said third pipe stage, the address on the external addressbus for loads and stores that go off-chip; and, (L) placing (130) dataon said external data bus during the fourth pipe stage; and, (M)returning (132) from said bus controller (10) returns said data to saidregister file during the 5the pipe stage.